DS 6030 | Fall 2021 | University of Virginia

NOTE TO GRADER: I collaborated with Geoff Hansen on problem 2, and referenced Class 16 Notes for all problems. I often used starter code from these class notes as well.


Required R packages and Directories

data.dir = 'https://mdporter.github.io/DS6030/data/' # data directory
library(R6030)     # functions for DS-6030
library(tidyverse) # functions for data manipulation   
library(mclust)    # functions for mixture models
library(mixtools)  # poisregmixEM() function

Problem 1: Customer Segmentation with RFM (Recency, Frequency, and Monetary Value)

RFM analysis is an approach that some businesses use to understand their customers’ activities. At any point in time, a company can measure how recently a customer purchased a product (Recency), how many times they purchased a product (Frequency), and how much they have spent (Monetary Value). There are many ad-hoc attempts to segment/cluster customers based on the RFM scores (e.g., here is one based on using the customers’ rank of each dimension independently: https://joaocorreia.io/blog/rfm-analysis-increase-sales-by-segmenting-your-customers.html). In this problem you will use the clustering methods we covered in class to segment the customers.

The data for this problem can be found here: https://mdporter.github.io/DS6030/data//RFM.csv. Cluster based on the Recency, Frequency, and Monetary value columns.

## Read the data

file=file.path(data.dir, "RFM.csv")
df=read_csv(file)
#> Rows: 10000 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (4): id, Recency, Frequency, Monetary
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df=select(df, -id)

## Visualize

plot(df$Recency)

## Model-based clustering: recency, frequency, and monetary

cluster1 = Mclust(df,verbose=FALSE)
summary(cluster1)
#> ---------------------------------------------------- 
#> Gaussian finite mixture model fitted by EM algorithm 
#> ---------------------------------------------------- 
#> 
#> Mclust EVV (ellipsoidal, equal volume) model with 9 components: 
#> 
#>  log-likelihood     n df     BIC     ICL
#>         -151784 10000 81 -304313 -314488
#> 
#> Clustering table:
#>    1    2    3    4    5    6    7    8    9 
#>  515 2472 1394 2530 1375  167  952  148  447
plot(cluster1, what="density")

a. Implement hierarchical clustering.

  • Describe any pre-processing steps you took (e.g., scaling, distance metric)
  • State the linkage method you used with justification.
  • Show the resulting dendrogram
  • State the number of segments/clusters you used with justification.
  • Using your segmentation, are customers 1 and 100 in the same cluster?

I scaled the data features (mean = 0, SD =1) so that the monetary value feature didn’t dominate the linkage and distance metrics. For the distance metric, I used Euclidean distance, since it is the easiest to interpret for a layman. Although Euclidean distance should be avoided for frequency metrics alone, I believe it is a usetful metric for monetary value and recency in order to segment shoppers. I used complete linkage, which I perceive as a more conservative clustering approach, primarily because it resulted in a more evenly distributed dendrogram that has a slowing marginal distance increase after height 3. I decided to cut my dendrogram at height 4.775, yielding five clusters (K=5), because my elbow plot suggested only marginal decreases in height for additional clusters thereafter. Given this segmentation, customers 1 and 100 are not in the same cluster (customer 1 is in cluster 1, while customer 100 is in cluster 2).

## Scaling and distance

df_scaled=scale(df) # Sd=1, mean of 0

## Hierarchical cluster

dX=dist(df_scaled, method="euclidean")
hc=hclust(dX, method="complete")
plot(hc)

## Elbow plot

tibble(height = hc$height, K = row_number(-height)) %>%
  ggplot(aes(K, height)) +
  geom_line() +
  geom_point(aes(color = "black")) +
  scale_color_identity() +
  coord_cartesian(xlim=c(1, 50), ylim=c(0, 15))

## Find the elbow height/K

test=tibble(height=hc$height,
            K=row_number(-height))
test[(test$K==5),]

## Find membership vector and compare membership (K) of 100th and 1st customer in dataset.

yhat = cutree(hc, k=5)
result = tibble(membership=yhat) 
result[100,1]==result[1,1]

b. Implement k-means.

  • Describe any pre-processing steps you took (e.g., scaling)
  • State the number of segments/clusters you used with justification.
  • Using your segmentation, are customers 1 and 100 in the same cluster?

Again, I scaled the data features (mean = 0, SD =1) so that the monetary value feature didn’t dominate the clustering process. I ran kmeans clustering for 1:25 clusters. I decided to choose four clusters (K=4) because the decrease in SSE thereafter was marginal, striking a balance between variance and bias. Customers 1 and 100 were not in the same cluster utilizing k-means.

## K means

#-- Run kmeans for multiple K
Kmax = 25
SSE = numeric(Kmax) 
for(k in 1:Kmax){
  km = kmeans(df_scaled, centers=k, nstart=25)
  SSE[k] = km$tot.withinss
}
#-- Plot results

plot(1:Kmax, SSE, type='o', las=1, xlab="K")
## Customer comparison

km = kmeans(df_scaled, centers=4, nstart=25)
result=tibble(membership=km$cluster)
result[100,1]==result[1,1]

c. Implement model-based clustering

  • Describe any pre-processing steps you took (e.g., scaling)
  • State the number of segments/clusters you used with justification.
  • Describe the best model. What restrictions are on the shape of the components?
  • Using your segmentation, are customers 1 and 100 in the same cluster?

Again, I scaled the data features (mean = 0, SD =1) so that the monetary value feature didn’t dominate the clustering process. The optimal number of clusters was eight (9 were tested by default), according to the mclust package This component count resulted in maximum BIC (log-likelihood interpretation of mclust). The best model was a “VVE” model, which corresponds to variable shape, variable volume, and equal orientation of ellipsoidal components. Using this segmentation, customers 1 and 100 were not in the same cluster.

d. Discuss how you would cluster the customers if you had to do this for your job. Do you think one model would do better than the others?

I think my choice of cluster modeling approach would depend on the type of customer data I was analyzing, as well as the goal of my analysis. For rapid decision making on highly discrete data, I would likely use k-means clustering or hierarchical clustering due to their hard classification approach. For more continuous data, such as churn rates, and in less time constrained environments, I would likely prefer model-based clustering, implementing the EM algorithm for softer classifications of my customer data.

plot(cluster1, what="classification")

Problem 2: Poisson Mixture Model

The pmf of a Poisson random variable is: \[\begin{align*} f_k(x; \lambda_k) = \frac{\lambda_k^x e^{-\lambda_k}}{x!} \end{align*}\]

A two-component Poisson mixture model can be written: \[\begin{align*} f(x; \theta) = \pi \frac{\lambda_1^x e^{-\lambda_1}}{x!} + (1-\pi) \frac{\lambda_2^x e^{-\lambda_2}}{x!} \end{align*}\]

a. What are the parameters of the model?

image1

b. Write down the log-likelihood for \(n\) independent observations (\(x_1, x_2, \ldots, x_n\)).

image2

c. Suppose we have initial values of the parameters. Write down the equation for updating the responsibilities.

image2

d. Suppose we have responsibilities, \(r_{ik}\) for all \(i=1, 2, \ldots, n\) and \(k=1,2\). Write down the equations for updating the parameters.

image3

image3

e. Fit a two-component Poisson mixture model, report the estimated parameter values, and show a plot of the estimated mixture pmf for the following data:

#-- Run this code to generate the data
set.seed(123)             # set seed for reproducibility
n = 200                   # sample size
z = sample(1:2, size=n, replace=TRUE, prob=c(.25, .75)) # sample the latent class
theta = c(8, 16)          # true parameters
y = ifelse(z==1, rpois(n, lambda=theta[1]), rpois(n, lambda=theta[2]))
  • Note: The function poisregmixEM() in the R package mixtools is designed to estimate a mixture of Poisson regression models. We can still use this function for our problem of pmf estimation if it is recast as an intercept-only regression. To do so, set the \(x\) argument (predictors) to x = rep(1, length(y)) and addintercept = FALSE.
    • Look carefully at the output from this model. The beta values (regression coefficients) are on the log scale.
library(mixtools)

?poisregmixEM()
results=poisregmixEM(y=y,x=rep(1,length(y)),addintercept=FALSE)
#> number of iterations= 46
summary(results)
#> summary of poisregmixEM object:
#>          comp 1   comp 2
#> lambda 0.728314 0.271686
#> beta1  2.781594 2.064644
#> loglik at estimate:  -611.2
plot(x=0:10, dpois(0:10, lambda=results$lambda))

f. 2 pts Extra Credit: Write a function that estimates this two-component Poisson mixture model using the EM approach. Show that it gives the same result as part e.

  • Note: you are not permitted to copy code. Write everything from scratch and use comments to indicate how the code works (e.g., the E-step, M-step, initialization strategy, and convergence should be clear).
  • Cite any resources you consulted to help with the coding.

ADD SOLUTION HERE